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Abstract 

Blurring  the  line  between  software  and  hardware,  re- 
configurable  devices  strike  a  balance  between  the  raw  high 
speed  of  custom  silicon  and  the  post-fabrication  flexibility 
of  general-purpose  processors.  While  this  flexibility  is  a 
boon  for  embedded  system  developers,  who  can  now  rapidly 
prototype  and  deploy  solutions  with  performance  approach¬ 
ing  custom  designs,  this  results  in  a  system  development 
methodology  where  functionality  is  stitched  together  from 
a  variety  of  “soft  IP  cores,”  often  provided  by  multiple  ven¬ 
dors  with  different  levels  of  trust.  Unlike  traditional  soft¬ 
ware  where  resources  are  managed  by  an  operating  system, 
soft  IP  cores  necessarily  have  very  fine  grain  control  over 
the  underlying  hardware.  To  address  this  problem,  the  em¬ 
bedded  systems  community  requires  novel  security  primi¬ 
tives  which  address  the  realities  of  modern  reconfigurable 
hardware.  We  propose  an  isolation  primitive,  moats  and 
drawbridges,  that  are  built  around  four  design  properties: 
logical  isolation,  interconnect  traceability,  secure  reconfig¬ 
urable  broadcast,  and  configuration  scrubbing.  Each  of 
these  is  a  fundamental  operation  with  easily  understood  for¬ 
mal  properties,  yet  maps  cleanly  and  efficiently  to  a  wide  va¬ 
riety  of  reconfigurable  devices.  We  carefully  quantify  the  re¬ 
quired  overheads  on  real  FPGAs  and  demonstrate  the  utility 
of  our  methods  by  applying  them  to  the  practical  problem  of 
memory  protection. 


1  Introduction 

Reconfigurable  hardware,  such  as  a  Field  Programmable 
Gate  Array  (FPGA),  provides  a  programmable  substrate 
onto  which  descriptions  of  circuits  can  be  loaded  and  exe¬ 
cuted  at  very  high  speeds.  Because  they  are  able  to  provide 
a  useful  balance  between  performance,  cost,  and  flexibil¬ 
ity,  many  critical  embedded  systems  make  use  of  FPGAs 
as  their  primary  source  of  computation.  For  example,  the 
aerospace  industry  relies  on  FPGAs  to  control  everything 
from  satellites  to  the  Mars  Rover.  Their  circuit-level  flexi¬ 
bility  allows  system  functionality  to  be  updated  arbitrarily 
and  remotely.  Real-time  and  military  projects,  such  as  the 
Joint  Strike  Fighter,  make  frequent  use  of  FPGAs  because 
they  provide  both  high-performance  and  well-defined  tim¬ 
ing  behavior,  but  they  do  not  require  the  costly  fabrication 
of  custom  chips. 

FPGA  technology  is  now  the  leading  design  driver  for 
almost  every  single  foundry1  meaning  that  they  enjoy  the 
benefits  of  production  on  a  massive  scale  (reduced  cost,  bet¬ 
ter  yield,  difficult  to  tamper  with),  yet  developers  are  free 
to  deploy  their  own  custom  circuit  designs  by  configuring 
the  device  in  the  appropriate  ways.  This  has  significantly 
lowered  the  primary  impediment  to  hardware  development, 
cost,  and  as  such  we  are  now  seeing  an  explosion  of  recon¬ 
figurable  hardware  based  designs  in  everything  from  face 

1 A  foundry  is  a  wafer  production  and  processing  plant  available  on  a 
contract  basis  to  companies  that  do  not  have  wafer  fab  capability  of  their 
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recognition  systems  [39],  to  wireless  networks  [42],  to  in¬ 
trusion  detection  systems  [20],  to  supercomputers  [5].  In 
fact  it  is  estimated  that  in  2005  alone  there  were  over  80,000 
different  commercial  FPGA  designs  projects  started.  [36] 
Unfortunately,  while  the  economics  of  the  semiconductor 
industry  has  helped  to  drive  the  widespread  adoption  of  re- 
configurable  devices  in  a  variety  of  critical  systems,  it  is 
not  yet  clear  that  such  devices,  and  the  design  flows  used  to 
configure  them,  are  actually  trustworthy. 

Reconfigurable  systems  are  typically  cobbled  together 
from  a  collection  of  exiting  modules  (called  cores)  in  order 
to  save  both  time  and  money.  Although  ideally  each  of  these 
cores  would  be  formally  specified,  tested,  and  verified  by  a 
highly  trusted  party,  in  reality,  such  a  development  model 
cannot  hope  to  keep  up  with  the  exponential  increases  in  cir¬ 
cuit  area  and  performance  made  possible  by  Moore’s  Law. 
Unlike  uni-processor  software  development,  where  the  pro¬ 
gramming  model  remains  fixed  as  transistor  densities  in¬ 
crease,  FPGA  developers  must  explicitly  take  advantage  of 
denser  devices  through  changes  in  their  design.  Given  that 
embedded  design  is  driven  in  large  part  by  the  demand  for 
new  features  and  the  desire  to  exploit  technological  scaling 
trends,  there  is  a  constant  pressure  to  mix  everything  on  a 
single  chip:  from  the  most  critical  functionality  to  the  latest 
fad.  Each  of  these  cores  runs  “naked”  on  the  reconfigurable 
device  (i.e.,  without  the  benefit  of  an  operating  system  or 
other  intermediate  layer),  and  it  is  possible  that  this  mixing 
of  trust  levels  could  be  silently  exploited  by  an  adversary 
with  access  to  any  point  in  the  design  flow  (including  de¬ 
sign  tools  or  implemented  cores).  In  an  unrestricted  design 
flow,  even  answering  the  question  of  “are  these  two  cores 
capable  of  communication”  is  computationally  difficult  to 
answer. 

Consider  a  more  concrete  example,  a  system  with  two 
soft-processor  cores  and  an  AES  encryption  engine  shar¬ 
ing  a  single  FPGA.  Each  of  these  three  cores  requires  ac¬ 
cess  to  off-chip  memory  to  store  and  retrieve  data.  How 
can  we  ensure  that  the  encryption  key  for  one  of  the  pro¬ 
cessors  cannot  be  obtained  by  the  other  processor  by  either 
reading  the  key  from  external  memory  or  directly  from  the 
encryption  core  itself?  There  is  no  virtual  memory  on  these 
systems,  and  after  being  mn  through  an  optimizing  CAD 
tool  the  resulting  circuit  is  a  single  entangled  mess  of  gates 
and  wires.  To  prevent  the  key  from  being  read  directly  from 
the  encryption  core  itself,  we  must  find  some  way  to  iso¬ 
late  the  encryption  engine  from  the  other  cores  at  the  gate 
level.  To  protect  the  key  in  external  memory,  we  need  to 
implement  a  memory  protection  module,  we  need  to  en¬ 
sure  that  each  and  every  memory  access  goes  through  this 
monitor,  and  we  need  to  ensure  that  all  cores  are  commu¬ 
nicating  only  through  their  specified  interfaces.  To  ensure 
these  properties  hold  at  even  the  lowest  levels  of  implemen¬ 
tation  (after  all  the  design  tools  have  finished  their  transfor¬ 


mations),  we  argue  that  slight  modifications  in  the  design 
methods  and  tools  can  enable  the  rapid  static  verification 
of  finished  FPGA  bitstreams2.  The  techniques  presented  in 
this  paper  are  steps  towards  a  cohesive  reconfigurable  sys¬ 
tem  design  methodology  that  explicitly  supports  cores  with 
varying  levels  of  trust  and  criticality  -  all  sharing  a  single 
physical  device. 

Specifically,  we  present  the  idea  of  Moats  and  Draw¬ 
bridges,  a  statically  verifiable  method  to  provide  isolation 
and  physical  interface  compliance  for  multiple  cores  on  a 
single  reconfigurable  chip.  The  key  idea  of  the  Moat  is  to 
provide  logical  and  physical  isolation  by  separating  cores 
into  different  areas  of  the  chip  with  “dead”  channels  be¬ 
tween  them  that  can  be  easily  verified.  Note  that  this  does 
not  require  a  specialized  physical  device;  rather,  this  work 
only  assumes  the  use  of  commercially  available  commodity 
parts.  Given  that  we  need  to  interconnect  our  cores  at  the 
proper  interfaces  (Drawbridges),  we  introduce  interconnect 
tracing  as  a  method  for  verifying  that  interfaces  carrying 
sensitive  data  have  not  been  tapped  or  routed  improperly  to 
other  cores  or  I/O  pads.  Furthermore,  we  present  a  tech¬ 
nique,  configuration  scrubbing,  for  ensuring  that  remnants 
of  a  prior  core  do  not  linger  following  a  partial  reconfigura¬ 
tion  of  the  system  to  enable  object  reuse.  Once  we  have  a 
set  of  drawbridges,  we  need  to  enable  legal  inter-core  com¬ 
munication.  We  describe  two  secure  reconfigurable  com¬ 
munication  architectures  that  can  be  easily  mapped  into  the 
unused  moat  areas  (and  statically  checked  for  isolation),  and 
we  quantify  the  implementation  trade-offs  between  them 
in  terms  of  complexity  of  analysis  and  performance.  Fi¬ 
nally,  to  demonstrate  the  efficacy  of  our  techniques,  we  ap¬ 
ply  them  to  a  memory  protection  scheme  that  enforces  the 
legal  sharing  of  off-chip  memory  between  multiple  cores. 

2  Reconfigurable  Systems 

As  mentioned  in  Section  1,  a  reconfigurable  system  is 
typically  constructed  piecemeal  from  a  set  of  existing  mod¬ 
ules  (called  cores)  in  order  to  save  both  time  and  money; 
rarely  does  one  design  a  full  system  from  scratch.  One 
prime  example  of  a  module  that  is  used  in  a  variety  of  con¬ 
texts  is  a  soft-processor.  A  soft-processor  is  simply  a  con¬ 
figuration  of  logical  gates  that  implements  the  functionality 
of  a  processor  using  the  reconfigurable  logic  of  an  FPGA. 
A  soft-processor,  and  other  intellectual  property  (IP)  cores3 
such  as  AES  implementations  and  Ethernet  controllers,  can 

2  bitstreams  are  the  term  for  the  detailed  configuration  hies  that  encode 
the  exact  implementation  of  a  circuit  on  reconfigurable  hardware  —  in  many 
ways  they  are  analogous  to  a  statically  linked  executable  on  a  traditional 
microprocessor 

3  Since  designing  reconfigurable  modules  is  costly,  companies  have 
developed  several  schemes  to  protect  this  valuable  intellectual  property, 
which  we  discuss  in  Section  6. 


be  assembled  together  to  implement  the  desired  function¬ 
ality.  Cores  may  come  from  design  reuse,  but  more  often 
than  not  they  are  purchased  from  third  party  vendors,  gen¬ 
erated  automatically  as  the  output  of  some  design  tool,  or 
even  gathered  from  open  source  repositories.  While  indi¬ 
vidual  cores  such  as  encryption  engines  may  be  formally 
verified  [30],  a  malicious  piece  of  logic  or  compromised 
design  tool  may  be  able  to  exploit  low  level  implementa¬ 
tion  details  to  quietly  eavesdrop  on,  or  interfere  with,  trusted 
logic.  As  a  modern  design  may  implement  millions  of  logi¬ 
cal  gates  with  tens  of  millions  of  interconnections,  the  goal 
of  this  paper  is  to  explore  design  techniques  that  will  allow 
the  inclusion  of  both  trusted  and  untrusted  cores  on  a  single 
chip,  without  the  requirement  that  expensive  static  verifica¬ 
tion  be  employed  over  the  entire  finished  design.  Such  ver¬ 
ification  of  a  large  and  complex  design  requires  reverse  en¬ 
gineering,  which  is  highly  impractical  because  many  com¬ 
panies  keep  details  about  their  bit-streams  proprietary. 

Increasingly  we  are  seeing  reconfigurable  devices 
emerge  as  the  flexible  and  high-performance  workhorses 
inside  a  variety  of  high  performance  embedded  computing 
systems  [4, 9, 1 1, 22,  35, 45],  but  to  understand  the  potential 
security  issues,  we  need  to  build  on  an  understanding  of  at 
least  a  simplified  modern  FPGA  design  flow.  In  this  section 
we  describe  a  modern  device,  a  typical  design  flow,  and  the 
potential  threats  that  our  techniques  are  expected  to  handle. 

2.1  Reconfigurable  Hardware 

FPGAs  lie  along  a  continuum  between  general-purpose 
processors  and  application-specific  integrated  circuits 
(ASICs).  While  general  purpose  processors  can  execute  any 
program,  this  generality  comes  at  the  cost  of  serialized  ex¬ 
ecution.  On  the  other  hand,  ASICs  can  achieve  impressive 
parallelism,  but  their  function  is  literally  hard  wired  into  the 
device.  The  power  of  reconfigurable  systems  lies  in  their 
ability  to  flexibly  customize  an  implementation  down  at  the 
level  of  individual  bits  and  logic  gates  without  requiring 
a  custom  piece  of  silicon.  This  can  often  result  in  perfor¬ 
mance  improvements  on  the  order  of  lOOx  as  compared  to, 
per  unit  silicon,  a  similar  microprocessor  [7,  10,  50]. 

The  growing  popularity  of  reconfigurable  logic  has 
forced  practitioners  to  begin  to  consider  security  implica¬ 
tions,  but  as  of  yet  there  is  no  set  of  best  design  practices  to 
guide  their  efforts.  Furthermore,  the  resource  constrained 
nature  of  embedded  systems  is  perceived  to  be  a  challenge 
to  providing  a  high  level  of  security  [26].  In  this  paper 
we  describe  a  set  of  low  level  methods  that  a)  allow  effec¬ 
tive  reasoning  about  high  level  system  properties,  b)  can  be 
supported  with  minimal  changes  to  existing  tool  flows,  c) 
can  be  statically  verified  with  little  effort,  d)  incur  relatively 
small  area  and  performance  overheads,  and  e)  can  be  used 
with  commercial  off-the-shelf  parts.  The  advantage  of  de¬ 


veloping  security  primitives  for  FPGAs  is  that  we  can  im¬ 
mediately  incorporate  our  primitives  into  the  reconfigurable 
design  flow  today,  and  we  are  not  dependent  on  the  often  re¬ 
luctant  industry  to  modify  the  design  of  their  silicon. 

2.2  Mixed-Trust  Design  Flows 

Figure  1  shows  a  few  of  the  many  different  design  flows 
used  to  compose  a  single  modern  embedded  system.  The 
reconfigurable  implementation  relies  on  a  large  number  of 
sophisticated  software  tools  that  have  been  created  by  many 
different  people  and  organizations.  Soft  IP  cores,  such  as  an 
AES  core,  can  be  distributed  in  the  form  of  Hardware  De¬ 
scription  Language  (HDL),  netlists4  or  a  bitstream.  These 
cores  can  be  designed  by  hand,  or  they  can  be  automatically 
generated  by  computer  programs.  For  example,  the  Xil- 
inx  Embedded  Development  Kit  (EDK)  [53]  software  tool 
generates  soft  microprocessors  from  C  code.  Accel  DSP 
[17]  translates  MATLAB  [48]  algorithms  into  HDL,  logic 
synthesis  translates  this  HDL  into  a  netlist,  a  synthesis  tool 
uses  a  place-and-route  algorithm  to  convert  this  netlist  into 
a  bitstream,  with  the  final  result  being  an  implementation  of 
a  specialized  signal  processing  core. 

Given  that  all  of  these  different  design  tools  produce  a  set 
of  inter-operating  cores,  you  can  only  trust  your  final  system 
as  much  as  you  trust  your  least-trusted  design  path.  If  there 
is  a  critical  piece  of  functionality,  e.g.  a  unit  that  protects 
and  operates  on  secret  keys,  there  is  no  way  to  verify  that 
this  core  cannot  be  snooped  on  or  tampered  without  a  set  of 
isolation  strategies. 

The  subversion  of  design  tools  could  easily  result  in  ma¬ 
licious  hardware  being  loaded  onto  the  device.  In  fact,  ma¬ 
jor  design  tool  developers  have  few  or  no  checks  in  place 
to  ensure  that  attacks  on  specific  functionality  are  not  in¬ 
cluded.  However,  just  to  be  clear,  we  are  not  proposing 
a  method  that  makes  possible  the  use  of  subverted  design 
tools  on  a  trusted  core.  Rather,  we  are  proposing  a  method 
by  which  small  trusted  cores,  developed  with  trusted  tools 
(perhaps  using  in-house  tools  which  are  not  fully  optimized 
for  performance5)  can  be  safely  combined  with  untrusted 
cores. 

2.3  Motivating  Examples 

We  have  already  discussed  the  example  of  a  system  with 
two  processor  cores  and  an  encryption  core.  The  goal  of  our 
methods  is  to  prevent  the  encryption  key  for  one  of  the  pro¬ 
cessors  from  being  obtained  by  the  other  processor  by  either 

4Essentially  a  list  of  logical  gates  and  their  interconnections 

5FPGA  manufacturers  such  as  Xilinx  provide  signed  cores  that  can  be 
trusted  by  embedded  designers,  while  those  freely  available  cores  obtained 
from  sources  such  as  OpenCores  are  considered  to  be  less  trustworthy.  The 
development  of  a  trusted  tool  chain  or  a  trusted  core  is  beyond  the  scope 
of  this  paper. 
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Figure  1.  A  Modern  FPGA-based  Embedded  System:  Distinct  cores  with  different  pedigrees  and 
varied  trust  requirements  find  themselves  occupying  the  same  silicon.  Reconfigurable  logic,  hard 
and  soft  processor  cores,  blocks  of  SRAM,  and  other  soft  IP  cores  ail  share  the  FPGA  and  the  same 
off-chip  memory.  How  can  we  ensure  that  the  encryption  key  for  one  of  the  processors  cannot  be 
obtained  by  the  other  processor  by  either  reading  the  key  from  external  memory  or  directly  from  the 
encryption  core  itself? 


reading  the  key  from  external  memory  or  directly  from  the 
encryption  core  itself. 

Aviation  -  Both  military  and  commercial  sectors  rely 
on  commercial  off-the-shelf  (COTS)  reconfigurable  com¬ 
ponents  to  save  time  and  money.  Consider  the  example 
of  avionics  in  military  aircraft  in  which  sensitive  target¬ 
ing  data  is  processed  on  the  same  device  as  less  sensitive 
maintenance  data.  In  such  military  hardware  systems,  cer¬ 
tain  processing  components  are  “cleared”  for  different  lev¬ 
els  of  data.  Since  airplane  designs  must  minimize  weight,  it 
is  impractical  to  have  a  separate  device  for  every  function. 
Our  security  primitives  can  facilitate  the  design  of  military 
avionics  by  providing  separation  of  modules  that  must  be 
integrated  onto  a  single  device. 

Computer  Vision  -  In  the  commercial  world,  consider  a 
video  surveillance  system  that  has  been  designed  to  protect 
privacy.  Intelligent  video  surveillance  systems  can  iden¬ 
tify  human  behavior  that  is  potentially  suspicious,  and  this 
behavior  can  be  brought  to  the  attention  of  a  human  op¬ 
erator  to  make  a  judgment  [40]  [21].  IBM’s  People  Vision 
project  has  been  developing  such  a  video  surveillance  sys¬ 
tem  [46]  that  protects  the  privacy  of  individuals  by  blurring 
their  faces  depending  on  the  credentials  of  the  viewer  (e.g., 


security  guards  vs.  maintenance  technicians).  FPGAs  are 
a  natural  choice  for  any  streaming  application  because  they 
can  provide  deep  regular  pipelines  of  computation,  with  no 
shortage  of  parallelism.  Implementing  such  a  system  would 
require  at  least  three  cores  on  the  FPGA:  a  video  interface 
for  decoding  the  video  stream,  a  redaction  mechanism  for 
blurring  faces  in  accordance  with  a  policy,  and  a  network 
interface  for  sending  the  redacted  video  stream  to  the  se¬ 
curity  guard’s  station.  Each  of  these  modules  would  need 
buffers  of  off-chip  memory  to  function,  and  our  methods 
could  prevent  sensitive  information  from  being  shared  be¬ 
tween  modules  improperly  (e.g.  directly  between  the  video 
interface  and  the  network).  While  our  techniques  could  not 
verify  the  correct  operation  of  the  redaction  core,  they  could 
ensure  that  only  the  connections  necessary  for  legal  com¬ 
munication  between  cores  are  made. 

Now  that  we  have  described  a  high  level  picture  of  the 
problem  we  are  attempting  to  address,  we  present  our  two 
concepts,  moats  and  drawbridges,  along  with  the  details  of 
how  each  maps  to  a  modern  reconfigurable  device.  In  par¬ 
ticular,  for  each  approach  we  specify  the  threats  that  it  ad¬ 
dresses,  the  details  of  the  technique  and  its  implementation, 
and  the  overheads  involved  in  its  use.  Finally,  in  Section  5, 
we  show  how  these  low-level  protection  mechanisms  can  be 
used  in  the  implementation  of  a  higher-level  memory  pro- 


tection  primitive. 

3  Physical  Isolation  with  Moats 

As  discussed  in  Section  2,  a  strong  notion  of  isolation 
is  lacking  in  current  reconfigurable  hardware  design  flows, 
yet  one  is  needed  to  be  certain  that  cores  are  not  snooping 
on  or  interfering  with  each  other.  Before  we  can  precisely 
describe  the  problem  that  moats  attempt  to  solve,  we  need 
to  begin  with  a  brief  description  of  how  routing  works  (and 
the  function  it  serves)  in  a  modern  FPGA. 

On  a  modern  FPGA,  the  vast  majority  of  the  actual  sili¬ 
con  area  is  taken  up  by  interconnect  (approximately  90%). 
The  purpose  of  this  interconnect  is  to  make  it  easy  to  con¬ 
nect  logical  elements  together  so  that  any  circuit  can  be  re¬ 
alized.  For  example,  the  output  of  one  NAND  gate  may  be 
routed  to  the  input  of  another,  or  the  address  wires  from  a 
soft -processor  may  be  routed  to  an  I/O  pad  connected  to  ex¬ 
ternal  memory.  The  routing  is  completely  static:  a  virtual 
wire  is  created  from  input  to  output,  but  that  signal  may  be 
routed  to  many  different  places  simultaneously  (e.g.,  one 
output  to  many  inputs  or  vice  versa). 

The  rest  of  the  FPGA  is  a  collection  of  programmable 
gates  (implemented  as  small  lookup-tables  called  LUTs), 
flip-flops  for  timing  and  registers,  and  I/O  blocks  (IOB)  for 
transferring  data  into  and  out  of  the  device.  A  circuit  can 
be  mapped  to  an  FPGA  by  loading  the  LUTs  and  switch- 
boxes  with  a  configuration,  a  method  that  is  analogous  to 
the  way  a  traditional  circuit  might  be  mapped  to  a  set  of 
logical  gates.  An  FPGA  is  programmed  using  a  bitstream. 
This  binary  data  is  loaded  into  the  FPGA  to  execute  a  partic¬ 
ular  task.  The  bitstream  contains  all  the  information  needed 
to  provide  a  functional  device,  such  as  the  configuration  in¬ 
terface  and  the  internal  clock  cycle  supported  by  the  device. 

Without  an  isolation  primitive,  it  is  very  difficult  to  pre¬ 
vent  a  connection  between  two  cores  from  being  estab¬ 
lished.  Place-and-route  software  uses  performance  as  an 
objective  function  in  its  optimization  strategy,  which  can 
result  in  the  logical  elements  and  the  interconnections  of 
two  cores  to  be  intertwined.  Figure  3  makes  the  scope  of 
the  problem  more  clear.  The  left  hand  of  Figure  3  shows 
the  floor  plan  of  an  FPGA  with  two  small  cores  (soft  pro¬ 
cessors)  mapped  onto  it.  The  two  processors  overlap  sig¬ 
nificantly  in  several  areas  of  the  chip.  Ensuring  that  the 
two  never  communicate  requires  that  we  trace  every  single 
wire  to  ensure  that  only  the  proper  connections  are  made. 
Such  verification  of  a  large  and  complex  design  requires 
reverse  engineering,  which  is  highly  impractical  because 
many  companies  keep  the  necessary  details  about  their  bit- 
streams  secret.  With  moats,  fewer  proprietary  details  about 
the  bitstream  are  needed  to  accomplish  this  verification. 
The  difficulty  of  this  problem  is  made  more  clear  by  the 
zoom-in  on  the  right  of  Figure  3.  The  zoom-in  shows  a 


single  switch  box,  the  associated  LUTs  (to  the  right  of  the 
switch  box),  and  all  the  wires  that  cross  through  that  one 
small  portion  of  the  chip.  A  modern  FPGA  contains  on  the 
order  of  20,000  or  more  such  boxes. 

Isolation  is  required  in  order  to  protect  the  confidential¬ 
ity  and  integrity  of  a  core’s  data,  and  helps  to  prevent  inter¬ 
ference  with  a  core’s  functionality.  Our  technique  allows  a 
very  simple  static  check  to  verify  that,  at  least  at  the  routing 
layer,  the  cores  are  sufficiently  isolated. 

3.1  Building  Moats 

Moats  are  a  novel  method  of  enhancing  the  security  of 
FPGA  systems  via  the  physical  isolation  of  cores.  Our 
approach  involves  surrounding  each  core  with  a  “moat” 
that  blocks  wiring  connectivity  from  the  outside.  The  core 
can  only  communicate  with  the  outside  world  via  a  “draw¬ 
bridge”,  which  is  a  precisely  defined  path  to  the  outside 
world. 

One  straightforward  way  to  accomplish  this  is  to  align 
the  routing  tracks  used  by  each  of  these  modules  and  simply 
disable  the  switches  near  the  moat  boundaries.  The  prob¬ 
lem  with  this  simple  approach  is  that,  for  the  purposes  of 
improving  area  and  timing  efficiency,  modern  FPGA  archi¬ 
tectures  often  support  staggered,  multiple  track  segments. 
For  example,  the  Virtex  platform  supports  track  segments 
with  lengths  1,  2  and  6,  where  the  length  is  determined 
by  measuring  the  number  of  Configuration  Logic  Blocks 
(CLBs)  the  segment  crosses.  For  example,  a  length  6  seg¬ 
ment  will  span  6  CLBs,  providing  a  more  direct  connec¬ 
tion  by  skipping  unnecessary  switch  boxes  along  the  rout¬ 
ing  path.  Moreover,  many  platforms  such  as  Virtex  support 
“longline”  segments,  which  span  the  complete  row  or  col¬ 
umn  of  the  CLB  array. 

Figure  4  illustrates  our  moat  architecture.  If  we  allow  the 
design  tool  to  make  use  of  segment  lengths  of  one  and  two, 
the  moat  size  must  be  at  least  two  segments  wide  in  order 
to  successfully  isolate  two  cores  (otherwise  signals  could 
hop  the  moats  because  they  would  not  require  a  switch  box 
in  the  moat).  To  statically  check  that  a  moat  is  sound,  the 
following  properties  are  sufficient. 

1.  The  target  core  is  completely  surrounded  by  moat  of 
width  at  least  w 

2.  The  target  core  does  not  make  any  use  of  routing  seg¬ 
ments  longer  than  length  w 

In  fact,  both  of  these  properties  are  easy  to  inspect  on 
an  FPGA.  We  can  tell  if  a  switch  box  is  part  of  a  moat  by 
simply  checking  that  it  is  completely  dead  (i.e.,  all  the  rout¬ 
ing  transistors  are  configured  to  be  disconnected).  We  can 
check  the  second  property  by  examining  all  of  the  long  line 
switch  boxes  to  ensure  that  they  are  unused.  These  are  easy 
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Figure  2.  A  simplified  representation  of  an  FPGA  fabric  is  on  the  left.  Configurable  Logic  Blocks 
(CLBs)  perform  logic  level  computation  using  Lookup  Tables  (LUTs)  for  bit  manipulations  and  flip- 
flops  for  storage.  The  switch  boxes  and  routing  channels  provide  connections  between  the  CLBs. 
SRAM  configuration  bits  are  used  throughout  the  FPGA  (e.g.,  to  program  the  logical  function  of  the 
LUTs  and  connect  a  segment  in  one  routing  channel  to  a  segment  in  an  adjacent  routing  channel). 
The  FPGA  floor  plan  on  the  right  shows  the  layout  of  three  cores  -  notice  how  they  are  intertwined. 


to  find  because  they  are  tied  to  the  physical  FPGA  design 
and  are  not  a  function  of  the  specific  core  on  the  FPGA. 

3.2  A  Performance/ Area  Trade-off 

On  an  FPGA,  the  delay  of  a  connection  depends  on  the 
number  of  switch  boxes  it  must  pass  through  rather  than  the 
total  length.  Although  large  moats  consume  a  great  deal  of 
chip  area  (because  they  reserve  switch  boxes  without  mak¬ 
ing  use  of  them  to  perform  an  operation),  they  allow  the 
design  tools  to  make  use  of  longer  segments,  which  helps 
with  the  area  and  performance  of  each  individual  core.  On 
the  other  hand,  small  moats  require  less  chip  area  (for  the 
moat  itself),  but  having  to  use  small  segments  negatively 
affects  the  area  and  performance  of  the  cores. 

A  set  of  experiments  is  needed  to  understand  the  trade¬ 
offs  between  the  size  of  the  moats,  the  number  of  cores  that 
can  be  protected  using  moats,  and  the  performance  and  area 
implications  for  moat  protection. 

3.3  The  Effect  of  Constrained  Routing 

We  begin  by  quantifying  the  effect  of  constraining  the 
tools  to  generate  only  configurations  that  do  not  use  any 
routing  segments  longer  than  length  w.  The  width  of  the 
moat  could  be  any  size,  but  the  optimal  sizes  are  dictated  by 
the  length  of  the  routing  segments.  As  mentioned  before, 
FPGAs  utilize  routing  segments  of  different  sizes,  most 
commonly  1,  2,  6  and  long  lines.  If  we  could  eliminate 
the  long  lines,  then  we  would  require  a  size  6  moat  for  pro¬ 


tecting  a  core.  By  eliminating  long  lines  and  hex  lines,  we 
only  need  a  moat  of  size  2,  and  so  on. 

In  order  to  study  the  impact  of  eliminating  certain  long 
length  segments  on  routing  quality,  we  compare  the  routing 
quality  of  the  MCNC  benchmarks  [32]  on  different  segment 
configurations.  We  use  the  Versatile  Placement  and  Rout¬ 
ing  (VPR)  toolkit  developed  by  the  University  of  Toronto 
for  such  experiments.  VPR  provides  mechanisms  for  exam¬ 
ining  trade-offs  between  different  FPGA  architectures  and 
is  popular  within  the  research  community  [3].  Its  capabili¬ 
ties  to  define  detailed  FPGA  routing  resources  include  sup¬ 
port  for  multiple  segment  routing  tracks  and  the  ability  for 
the  user  to  define  the  distribution  of  the  different  segment 
lengths.  It  also  includes  a  realistic  cost  model  which  pro¬ 
vides  a  basis  for  the  measurement  of  the  quality  of  the  rout¬ 
ing  result. 

The  effect  of  the  routing  constraints  on  performance  and 
area  can  vary  across  different  cores.  Therefore,  we  route  the 
20  biggest  applications  from  the  MCNC  benchmark  set  [32] 
(the  de  facto  standard  for  such  experiments)  using  four  dif¬ 
ferent  configurations.  The  baseline  configuration  supports 
segments  with  length  1,  2,  6  and  longlines.  The  distribu¬ 
tion  of  these  segments  on  the  routing  tracks  are  8%,  20%, 
60%  and  12%  respectively,  which  is  similar  to  the  Xilinx 
Virtex  II  platform.  The  other  three  configurations  are  de¬ 
rived  from  the  baseline  configurations  by  eliminating  the 
segments  with  longer  lengths.  In  other  words,  configuration 
1-2-6  will  have  no  longlines,  configuration  1-2  will  support 
segments  of  length  1  and  2,  and  configuration  1  will  only 
support  segments  of  length  1 . 

After  performing  placement  and  routing,  we  measure  the 
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Figure  3.  A  simple  two-core  system  mapped  onto  a  small  FPGA.  The  zoom-in  to  the  right  shows 
the  wiring  complexity  at  each  and  every  switch-box  on  the  chip.  To  statically  analyze  a  large  FPGA 
with  10s  of  cores  and  millions  of  logical  gates,  we  need  to  restrict  the  degrees  of  freedom.  Static 
verification  of  a  large,  complex  design  involving  intertwined  cores  requires  reverse  engineering, 
which  is  highly  impractical  because  many  companies  keep  the  necessary  details  about  their  bit- 
streams  a  closely  guarded  trade  secret. 


quality  of  the  routing  results  by  collecting  the  area  and  the 
timing  performance  based  on  the  critical  path  of  the  mapped 
application.  To  be  fair,  all  the  routing  tracks  are  config¬ 
ured  using  the  same  tri-state  buffered  switches  with  Wilton 
connection  patterns  [52]  within  the  switch  box.  A  Wilton 
switch  box  provides  a  good  trade-off  between  routability 
and  area,  and  is  commonly  used  in  FPGA  routing  architec¬ 
tures. 

Figures  5  and  6  show  the  experimental  results,  where  we 
provide  the  average  hardware  area  cost  and  critical  path  per¬ 
formance  for  all  the  benchmarks  over  four  configurations. 
The  existence  of  longlines  has  little  impact  on  the  final  qual¬ 
ity  of  the  mapped  circuits.  However,  significant  degradation 
occurs  when  we  eliminate  segments  of  length  2  and  6.  This 
is  caused  by  the  increased  demand  for  switch  boxes,  result¬ 
ing  in  a  larger  hardware  cost  for  these  additional  switch  re¬ 
sources.  Moreover,  the  signal  from  one  pin  to  another  pin  is 
more  likely  to  pass  more  switches,  resulting  in  an  increase 
in  the  critical  path  timing.  If  we  eliminate  hex  and  long 
lines,  there  is  a  14.9%  area  increase  and  an  18.9%  increase 
in  critical  path  delay,  on  average.  If  the  design  performance 
is  limited  directly  by  the  cycle  time,  the  delay  in  critical 
path  translates  directly  into  slowdown. 

3.4  Overall  Area  Impact 

While  the  results  from  Figures  5  and  6  show  that  there 
is  some  area  impact  from  constraining  the  routing,  there  is 
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Figure  7.  The  trade-off  between  the  number 
of  cores,  the  size  of  the  moat,  and  the  utiliza¬ 
tion  of  the  FPGA.  An  increasing  number  of 
cores  results  in  larger  total  moat  area,  which 
reduces  the  overall  utilization  of  the  FPGA. 
Larger  moat  sizes  also  will  use  more  area  re¬ 
sulting  in  lower  utilization. 


Figure  4.  We  use  moats  to  physically  isolate  cores  for  security.  In  this  example,  segments  can  either 
span  one  or  two  switch  boxes,  which  requires  the  moat  size  to  have  a  length  of  two.  Since  the 
delay  of  a  connection  on  an  FPGA  depends  on  the  number  of  switch  boxes  it  must  pass  through, 
restricting  the  length  of  segments  reduces  performance,  but  the  moats  can  be  smaller.  Allowing 
longer  segments  improves  performance,  but  the  moats  must  waste  more  area. 


also  a  direct  area  impact  in  the  form  of  resources  required 
to  implement  the  actual  moats  themselves.  Assuming  that 
we  have  a  fixed  amount  of  FPGA  real  estate,  we  really  care 
about  how  much  of  that  area  is  used  up  by  a  combination 
of  the  moats  and  the  core  inflation  due  to  restricted  routing. 
We  can  call  this  number  the  effective  utilization.  Specifi¬ 
cally,  the  effective  utilization  is: 

TT  -d AllRout  es 

Ueff  =  A - TA - 

Restricted  Routes  i  s*-Moats 

Figure  7  presents  the  trade-offs  between  the  moat  size, 
the  number  of  isolated  cores  on  the  FPGA,  and  the  utiliza¬ 
tion  of  the  FPGA.  The  FPGA  used  for  these  calculations 
was  a  Xilinx  Virtex-4  Device  which  has  192  CLB  rows 
and  116  CLB  columns.  The  figure  examines  three  differ¬ 
ent  moat  sizes:  1,  2  and  6  for  a  variable  number  of  cores  on 
the  chip  (conservatively  assuming  that  a  moat  is  required 
around  all  cores).  As  the  number  of  cores  increases,  the  uti¬ 
lization  of  the  FPGA  decreases  since  the  area  of  the  moats, 
which  is  unusable  space,  increases.  However,  when  a  small 
number  of  cores  is  used,  a  larger  moat  size  is  better  because 
it  allows  us  to  make  more  efficient  use  of  the  non-moat  parts 
of  the  chip.  If  you  just  need  to  isolate  a  single  core  (from  the 
I/O  pads)  then  a  moat  of  width  6  is  the  best  (consuming  12% 
of  the  chip  resources).  However,  as  the  curve  labeled  “Moat 
Size  =  2“  in  Figure  7  shows,  a  moat  width  of  two  has  the  op¬ 
timal  effective  utilization  for  designs  that  have  between  two 
and  120  cores.  As  a  point  of  reference,  it  should  be  noted 
that  a  modern  FPGA  can  hold  on  the  order  of  100  stripped 
down  microprocessor  cores.  The  number  of  cores  is  heav¬ 
ily  dependent  on  the  application,  and  the  trade-off  presented 
here  is  somewhat  specific  to  our  particular  platform,  but  our 
analysis  method  is  still  applicable  to  other  designs.  In  fact, 
as  FPGAs  continue  to  grow  according  to  Moore’s  Law,  the 


percent  overhead  for  moats  should  continue  to  drop.  Be¬ 
cause  the  moats  are  perimeters,  as  the  size  of  a  core  grows 
by  a  factor  of  n,  the  cost  of  the  moat  only  grows  by  0(y/n). 

3.5  Effective  Scrubbing  and  Reuse  of  Re- 
configurable  Hardware 

Moats  allow  us  to  reason  about  isolation  without  any 
knowledge  of  the  inner  workings  of  cores,  which  are  far  too 
complex  to  feasibly  determine  whether  a  particular  element 
of  a  core  is  connected  to  another  core.  Furthermore,  moats 
also  allow  us  to  isolate  cores  designed  with  a  less  trustwor¬ 
thy  tool  chain  from  cores  that  are  the  result  of  a  more  trust¬ 
worthy  tool  chain.  While  these  are  both  useful  properties, 
we  need  to  make  sure  we  can  actually  implement  them.  In 
fact,  a  few  of  the  latest  FPGAs  available  have  the  ability  to 
change  a  selective  part  of  their  configuration,  one  column  at 
a  time  [34].  A  specialized  core  on  the  FPGA  can  read  one 
frame  of  the  configuration,  change  part  of  this  frame,  and 
write  the  modified  frame  back.  This  core  must  therefore  be 
part  of  the  trusted  computing  base  of  the  system. 

Partial  reconfiguration  improves  the  flexibility  of  a  sys¬ 
tem  by  making  it  possible  to  swap  cores.  If  the  number  of 
possible  configurations  is  small,  then  static  verification  is 
sufficient,  but  if  the  space  of  possible  cores  is  infinite,  then 
dynamic  verification  is  necessary.  For  example,  Baker  et 
al.  have  developed  an  intrusion  detection  system  based  on 
reconfigurable  hardware  that  dynamically  swaps  the  detec¬ 
tion  cores  [2]  [1].  Since  the  space  of  intrusion  detection 
rule  sets  is  infinite,  the  space  of  detection  cores  is  also  in¬ 
finite.  Huffmire  et  al.  have  developed  a  memory  protec¬ 
tion  scheme  for  reconfigurable  hardware  in  which  a  recon¬ 
figurable  reference  monitor  enforces  a  policy  that  specifies 
the  legal  sharing  of  memory  [19].  Partial  reconfiguration 
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Figure  5.  Comparison  of  area  for  different 
configurations  of  routing  segments.  The 
baseline  system  has  segments  with  length  1, 
2,  6  and  longline.  The  distribution  is  close 
to  that  of  Virtex  II:  8%  (1),  20%  (2),  60%  (6) 
and  12%  (longline).  Other  configurations  are 
created  by  eliminating  one  or  more  classes 
of  segments.  For  example,  configuration  1-2- 
6  removes  the  longlines  and  distributes  them 
proportionally  to  other  types  of  segments. 


Configuration 

Figure  6.  Comparison  of  critical  path  timing 
for  different  configurations  of  routing  seg¬ 
ments.  Unlike  Figure  7,  the  graphs  in  Figures 
5  and  6  do  not  include  the  overhead  of  the 
moat  itself.  The  error  bars  show  one  standard 
deviation. 


could  allow  the  system  to  change  the  policy  being  enforced 
by  swapping  in  a  different  reference  monitor.  Since  the 
space  of  possible  policies  is  infinite,  the  space  of  possible 
reference  monitors  is  also  infinite.  Lysaght  and  Levi  have 
devised  a  dynamically  reconfigurable  crossbar  switch  [33]. 
By  using  dynamic  reconfiguration,  their  928x928  crossbar 
uses  4,836  CLBs  compared  to  the  53,824  CLBs  required 
without  reconfiguration. 

To  extend  our  model  of  moats  to  this  more  dynamic  case, 
we  not  only  need  to  make  sure  that  our  static  analysis  must 
be  simple  enough  to  be  performed  on-line  by  a  simple  em¬ 
bedded  core  (which  we  argue  it  is),  but  we  also  need  to  make 
sure  that  nothing  remains  of  the  prior  core’s  logic  when  it 
is  replaced  with  a  different  core.  In  this  section,  we  de¬ 
scribe  how  we  can  enable  object  reuse  through  configura¬ 
tion  cleansing. 

By  rewriting  a  selective  portion  of  the  configuration  bits 
for  a  certain  core,  we  can  erase  any  information  it  has  stored 
in  memory  or  registers.  The  ICAP  (Internal  Configuration 
Access  Port)  on  Xilinx  devices  allows  us  to  read,  modify, 
and  write  back  the  configuration  bitstream  on  Virtex  II  de¬ 
vices.  The  ICAP  can  be  controlled  by  a  Microblaze  soft 
core  processor  or  an  embedded  PowerPC  processor  if  the 
chip  has  one.  The  ICAP  has  an  8-bit  data  port  and  typi¬ 
cally  runs  at  a  clock  speed  of  50  MHz.  Configuration  data 
is  read  and  written  one  frame  at  a  time.  A  frame  spans  the 
entire  height  of  the  device,  and  frame  size  varies  based  on 


the  device. 

Table  1  gives  some  information  on  the  size  and  number 
of  frames  across  several  Xilinx  Virtex  II  devices.  The  small¬ 
est  device  has  404  frames,  and  each  frame  requires  5.04  us 
to  reconfigure,  or  equivalently,  erase.  Therefore,  reconfig¬ 
uring  (erasing)  the  entire  devices  takes  around  2  ms. 

To  sanitize  a  core  we  must  perform  3  steps.  First  we  must 
read  in  a  configuration  frame.  The  second  step  is  to  modify 
the  configuration  frame  so  that  the  flip-flops  and  memory 
are  erased.  The  last  step  is  to  write  back  the  modified  con¬ 
figuration  frame.  The  number  of  frames  and  how  much  of 
the  frame  we  must  modify  depend  on  the  size  of  the  core 
that  is  being  sanitized.  This  process  must  be  repeated  since 
each  core  will  span  the  width  of  many  frames.  In  general, 
the  size  of  the  core  is  linearly  related  to  the  time  that  is 
needed  to  sanitize  it. 

Our  object  reuse  technique  can  also  disable  a  core  if  ex¬ 
treme  circumstances  should  require  it,  such  as  tampering. 
Embedded  devices  such  as  cell  phones  are  very  difficult  to 
sanitize  [38].  Smart  phones  contain  valuable  personal  data, 
and  the  theft  or  loss  of  a  phone  can  result  in  serious  conse¬ 
quences  such  as  identity  theft.  Embedded  devices  used  by 
the  military  may  contain  vital  secrets  that  must  never  fall 
into  enemy  hands.  Furthermore,  valuable  IP  information  of 
the  cores  is  stored  in  the  form  of  the  bitstream  on  the  FPGA. 
A  method  of  disabling  all  or  part  of  the  device  is  needed  to 
protect  important  information  stored  on  the  FPGA  in  the 


Table  1.  Reconfiguration  Time 


Device 

#  Frames 

Frame  Length  (32-bit  words) 

R/W  time  for  1  frame  (ICAP@50  Mhz) 

XC2V40 

404 

26 

5.04  us 

XC2V500 

928 

86 

14.64  us 

XC2C2000 

1456 

146 

24.24  us 

XC2V8000 

2860 

286 

46.64  us 

extreme  case  of  physical  tampering. 

The  IBM  4758  is  an  example  of  a  cryptographic  copro¬ 
cessor  that  has  been  designed  to  detect  tampering  and  to 
disable  itself  whenever  tampering  occurs  [51].  The  device  is 
surrounded  by  specialized  packaging  containing  wire  mesh. 
Any  tampering  of  the  device  disturbs  this  mesh,  and  the  de¬ 
vice  can  respond  by  disabling  itself. 

4  Drawbridges:  Interconnect  Interface  Con¬ 
formance  with  Tracing 

In  the  previous  section,  we  described  an  effective  method 
for  isolating  cores  using  moats.  Our  moat  methodology 
eliminates  the  possibility  for  external  cores  to  tap  into  the 
information  contained  in  a  core  surrounded  by  the  moat. 
However,  cores  do  not  work  in  isolation  and  must  commu¬ 
nicate  with  other  cores  to  receive  and  send  data.  Therefore, 
we  must  allow  controlled  entry  into  our  core.  The  entry  or 
communication  is  only  allowed  with  prespecified  transac¬ 
tions  through  a  “drawbridge”.  We  must  know  in  advance 
which  cores  we  need  to  communicate  with  and  the  location 
of  those  cores  on  the  FPGA.  Often  times,  it  is  most  effi¬ 
cient  to  communicate  with  multiple  cores  through  a  shared 
interconnection  (i.e. ,  a  bus).  Again,  we  must  ensure  that 
bus  communications  are  received  by  only  the  intended  re¬ 
cipients).  Therefore,  we  require  methods  to  ensure  that  1) 
communication  is  established  only  with  the  specified  cores 
and  2)  communication  over  a  shared  medium  does  not  re¬ 
sult  in  a  covert  channel.  In  this  section,  we  present  two 
techniques,  interconnect  tracing  and  a  bus  arbiter,  to  handle 
these  two  requirements. 

We  have  developed  an  interconnect  tracing  technique  for 
preventing  unintended  flows  of  information  on  an  FPGA. 
Our  method  allows  a  designer  to  specify  the  connections  on 
a  chip,  and  a  static  analysis  tool  checks  that  each  connec¬ 
tion  only  connects  the  specified  components  and  does  not 
connect  with  anything  else.  This  interconnect  tracing  tool 
takes  a  bitstream  file  and  a  text  file  that  defines  the  modules 
and  interconnects  in  a  simple  language  which  we  have  de¬ 
veloped.  The  big  advantage  of  our  tool  is  that  it  allows  us  to 
perform  the  tracing  on  the  bitstream  file.  We  do  not  require 
a  higher  level  description  of  the  design  of  the  core.  Per¬ 
forming  this  analysis  during  the  last  stage  of  design  allows 


us  to  catch  illegal  connections  that  could  have  originated 
from  any  stage  in  the  design  process  including  the  design 
tools  themselves. 

In  order  for  the  tracing  to  work  we  must  know  the  loca¬ 
tions  of  the  modules  on  the  chip  and  the  valid  connections 
to/from  the  modules.  To  accomplish  this  we  place  moats 
around  the  cores  during  the  design  phase.  We  now  know  the 
location  of  the  cores  and  the  moats,  and  we  use  this  infor¬ 
mation  to  specify  a  text  file  that  defines:  all  the  cores  along 
with  their  location  on  the  chip,  all  I/O  pins  used  in  the  de¬ 
sign,  and  a  list  of  valid  connections.  Then  our  tool  uses  the 
JBits  API  [13]  to  analyze  the  bitstream  and  check  to  make 
sure  there  are  no  invalid  connections  in  the  design.  The  pro¬ 
cess  of  interconnect  tracing  is  performed  by  analyzing  the 
bitstream  to  determine  the  status  of  the  switchboxes.  We 
can  use  this  technique  to  trace  the  path  that  a  connection  is 
routed  along  and  ensure  that  it  goes  where  it  is  supposed  to. 
This  tracing  technique  allows  us  to  ensure  that  the  different 
cores  can  only  communicate  through  the  channels  we  have 
specified  and  that  no  physical  trap  doors  have  been  added 
anywhere  in  the  design. 

Ensuring  that  interconnects  between  modules  are  secure 
is  a  necessity  to  developing  a  secure  architecture.  This  prob¬ 
lem  is  made  more  complicated  by  the  abundance  of  routing 
resources  on  an  FPGA  and  the  ease  with  which  they  can  be 
reconfigured.  Our  proposed  interconnect  tracing  technique 
allows  us  to  ensure  the  integrity  of  connections  on  a  recon- 
figurable  device.  This  tool  gives  us  the  ability  to  perform 
checking  in  the  final  design  stage:  right  before  the  bitstream 
is  loaded  onto  the  device. 

4.1  Efficient  Communication  under  the 
Drawbridge  Model 

In  modern  reconfigurable  systems,  cores  communicate 
with  each  other  via  a  shared  bus.  Unfortunately,  the  shared 
nature  of  a  traditional  bus  architecture  raises  several  secu¬ 
rity  issues.  Malicious  cores  can  obtain  secrets  by  snooping 
on  the  bus.  In  addition,  the  bus  can  be  used  as  a  covert  chan¬ 
nel  to  leak  secret  data  from  one  core  to  another.  The  ease 
of  reconfigurability  on  FPGAs  allows  us  to  address  these 
problems  at  the  hardware  level. 

To  address  this  problem  of  covert  channels  and  bus 
snooping,  we  have  developed  a  shared  memory  bus  with 


Figure  8.  Architecture  alternative  1.  There  is 
a  single  arbiter  and  each  module  has  a  dedi¬ 
cated  connection  to  the  arbiter. 


Figure  9.  Architecture  alternative  2.  Each 
module  has  its  own  arbiter  that  prevents  bus 
snooping  and  a  central  time  multiplexer  that 
connects  to  all  the  arbiters. 


a  time  division  access.  The  bus  divides  the  time  equally 
among  the  modules,  and  each  module  can  read/write  one 
word  to/from  the  shared  memory  during  its  assigned  time 
slice.  Our  approach  of  arbitrating  by  time  division  elimi¬ 
nates  covert  channels.  With  traditional  bus  arbitration,  there 
is  a  possibility  of  a  bus-contention  covert  channel  to  exist  in 
any  shared  bus  system  where  multiple  cores  or  modules  ac¬ 
cess  a  shared  memory.  Via  this  covert  channel,  a  malicious 
core  can  modulate  its  bus  references,  altering  the  latency  of 
bus  references  for  other  modules.  This  enables  the  transfer 
of  information  between  any  two  modules  that  can  access  the 
bus  [18].  This  covert  channel  could  be  used  to  send  infor¬ 
mation  from  a  module  with  a  high  security  clearance  to  a 
module  with  lower  security  clearance  (write-down),  which 
would  violate  a  Bell-LaPadula  multilevel  policy  and  can¬ 
not  be  prevented  through  the  use  of  the  reference  monitor. 
To  eliminate  this  covert  channel,  we  give  each  module  an 
equal  share  of  time  to  use  the  bus,  eliminating  the  transfer 
of  information  by  modulating  bus  contention.  Since  each 
module  can  only  use  the  bus  during  its  alloted  time  slice, 
it  has  no  way  of  changing  the  bus  contention.  One  module 
cannot  even  tell  if  any  of  the  other  modules  are  using  the 
bus.  While  this  does  limit  performance  of  the  bus,  it  re¬ 
moves  the  covert  channel.  The  only  other  feasible  way  that 
we  see  to  remove  this  covert  channel  is  to  give  each  mod¬ 
ule  a  dedicated  connection  to  all  other  modules.  Requiring 
a  dedicated  direct  connection  between  each  set  of  modules 
that  need  to  communicate  would  be  inefficient  and  costly. 
Dedicated  channels  would  require  a  worst  case  of  0(2”) 
connections,  where  n  is  the  number  of  modules  in  the  de¬ 
sign.  Our  architecture  requires  only  O(n)  connections. 

Bus  snooping  is  another  major  concern  associated  with  a 
shared  bus.  Even  if  we  eliminate  the  covert  channels  there 
is  nothing  to  prevent  bus  snooping.  For  example,  let  us  con¬ 
sider  a  system  where  we  want  to  send  data  from  a  classified 
module  to  another  and  where  there  are  unclassified  modules 


on  the  same  bus.  We  need  a  way  to  ensure  that  these  less 
trusted  modules  cannot  obtain  this  information  by  snooping 
the  bus.  To  solve  this  problem,  we  place  an  arbiter  between 
the  module  and  the  memory.  The  arbiter  only  allows  each 
module  to  read  during  its  time  share  of  the  bus.  In  addition 
a  memory  monitor  is  required,  but  for  this  work  we  assume 
that  such  a  configuration  can  be  implemented  on  the  FPGA 
using  the  results  of  Huffmire  et.  al.[19] 

4.2  Architecture  Alternatives 

We  devised  two  similar  architectures  to  prevent  snoop¬ 
ing  and  to  eliminate  covert  channels  on  the  bus.  In  our  first 
architecture,  each  module  has  its  own  separate  connection 
to  a  single  arbiter,  which  sits  between  the  shared  memory 
and  the  modules.  This  arbiter  schedules  access  to  the  mem¬ 
ory  equally  according  to  a  time  division  scheduling  (Figure 
8).  A  module  is  only  allowed  to  read  or  write  during  its  al¬ 
loted  time,  and  when  a  module  reads,  the  data  is  only  sent  to 
the  module  that  issued  the  read  request.  The  second  archi¬ 
tecture  is  more  like  a  traditional  bus.  In  this  design,  there 
is  an  individual  arbiter  that  sits  between  each  module  and 
the  bus.  These  arbiters  are  all  connected  to  a  central  timing 
module  which  handles  the  scheduling  (Figure  9).  The  in¬ 
dividual  arbiters  work  in  the  same  way  as  the  single  arbiter 
in  the  first  architecture  to  prevent  snooping  and  to  remove 
covert  channels.  To  make  interfacing  easy,  both  of  these 
architectures  have  a  simple  interface  so  that  a  module  can 
easily  read/write  to  the  shared  memory  without  having  to 
worry  about  the  timing  of  the  bus  arbiter. 

During  the  design  process,  we  found  that  the  first  archi¬ 
tecture  seemed  easier  to  implement,  but  we  anticipated  that 
the  second  architecture  would  be  more  efficient.  In  our  first 
architecture  (Figure  8,  everything  is  centralized,  making  the 
design  of  a  centralized  memory  monitor  and  arbiter  much 
easier  to  design  and  verify.  In  addition,  a  single  moat  could 


be  used  to  isolate  this  functionality.  Our  second  architec¬ 
ture  (Figure  9)  intuitively  should  be  more  scalable  and  effi¬ 
cient  since  it  uses  a  bus  instead  of  individual  connections  for 
each  module,  but  the  arbiters  have  to  coordinate,  the  mem¬ 
ory  monitor  has  to  be  split  (if  that  is  even  possible),  and 
each  arbiter  need  to  be  protected  by  its  own  moat. 

To  test  our  hypotheses,  we  developed  prototypes  of  both 
of  the  architectures.  The  prototypes  were  developed  in 
VHDL  and  synthesized  for  a  Xilinx  Virtex-II  device  in  or¬ 
der  to  determine  the  area  and  performance  of  the  designs 
on  a  typical  FPGA.  We  did  not  account  for  the  extra  moat 
or  monitor  overhead,  but  with  this  assumption  results  of  the 
analysis  of  the  two  architectures,  which  can  be  seen  in  Ta¬ 
ble  2,  were  not  what  we  first  expected.  During  synthesis  of 
the  second  architecture,  the  synthesis  tool  converted  the  tri¬ 
state  buffers6  in  the  bus  to  digital  logic.  As  a  result,  the  sec¬ 
ond  architecture  used  more  area  than  the  first  and  only  had 
a  negligible  performance  advantage.  Contrary  to  what  we 
expected,  the  first  architecture  used  roughly  15%  less  area 
on  the  FPGA  and  is  simpler  to  implement  and  verify.  Since 
the  peformance  difference  between  the  two  was  almost  neg¬ 
ligible,  the  first  architecture  is  the  better  design  choice. 

This  bus  architecture  allows  modules  to  communicate  se¬ 
curely  with  a  shared  memory  and  prevents  bus  snooping  and 
certain  covert  channels.  When  combined  with  the  reference 
monitor  this  secure  bus  architecture  provides  a  secure  and 
efficient  way  for  modules  to  communicate. 

5  Application:  Memory  Policy  Enforcement 

Now  that  we  have  described  isolation  and  its  related 
primitives,  we  provide  an  example  of  the  application  of  iso¬ 
lation  to  memory  protection,  an  even  higher-level  primitive. 
Saltzer  and  Schroeder  identify  three  key  elements  that  are 
necessary  for  protection:  “Conceptually,  then,  it  is  neces¬ 
sary  to  build  an  impenetrable  wall  around  each  distinct  ob¬ 
ject  that  warrants  separate  protection,  construct  a  door  in 
the  wall  through  which  access  can  be  obtained,  and  post  a 
guard  at  the  door  to  control  its  use.’’  [43].  In  addition,  the 
guard  must  be  able  to  identify  the  authorized  users.  In  the 
case  of  protecting  cores,  our  moat  primitive  is  analogous  to 
the  wall,  and  our  drawbridge  primitive  is  analogous  to  the 
door.  Our  interconnect  tracing  and  secure  bus  primitives  act 
as  the  guard. 

One  way  of  protecting  memory  in  an  FPGA  system  is 
to  use  a  reference  monitor  that  is  loaded  onto  the  FPGA 
along  with  the  other  cores  [19].  Here,  the  reference  monitor 
is  analogous  to  the  guard  because  it  decides  the  legality  of 
every  memory  access  according  to  a  policy.  This  requires 
that  every  access  go  through  the  reference  monitor.  Without 

4  ri -stale  buffers  are  gates  that  can  output  either  a  0,  1 ,  or  Z  -  a  high 
impedance  state  in  which  the  gate  acts  as  if  it  was  disconnected  from  the 
wire. 


our  isolation  primitive,  it  is  easy  for  a  core  to  bypass  the 
reference  monitor  and  access  memory  directly.  Since  moats 
completely  surround  a  core  except  for  a  small  amount  of 
logic  (the  drawbridge)  for  communicating  with  the  rest  of 
the  chip,  it  is  much  easier  to  prevent  a  core  from  bypassing 
the  reference  monitor. 

Saltzer  and  Schroeder  describe  how  protection  mecha¬ 
nisms  can  protect  their  own  implementations  in  addition  to 
protecting  users  from  each  other  [43].  Protecting  the  ref¬ 
erence  monitor  from  attack  is  critical  to  the  security  of  the 
system,  but  the  fact  that  the  reference  monitor  itself  is  re- 
configurable  makes  it  vulnerable  to  attack  by  the  other  cores 
on  the  chip.  However,  moats  can  mitigate  this  problem  by 
providing  physical  isolation  of  the  reference  monitor. 

Our  isolation  primitive  also  makes  it  harder  for  an  unau¬ 
thorized  information  flow  from  one  core  to  another  to  oc¬ 
cur.  Establishing  a  direct  connection  between  the  two  cores 
would  clearly  thwart  the  reference  monitor.  If  moats  sur¬ 
round  each  core,  it  is  much  harder  to  connect  two  cores  di¬ 
rectly  without  crossing  the  moat. 

As  we  described  above,  a  reference  monitor  approach 
to  memory  protection  requires  that  every  memory  access 
go  through  the  reference  monitor.  However,  cores  are  con¬ 
nected  to  each  other  and  to  main  memory  by  means  of  a 
shared  bus.  As  we  explained  in  Section  4.1,  the  data  on  a 
shared  bus  is  visible  to  all  cores.  Our  secure  bus  primitive 
protects  the  data  flowing  on  the  bus  by  controlling  the  shar¬ 
ing  of  the  bus  with  a  fixed  time  division  approach. 

A  memory  protection  system  that  allows  dynamic  pol¬ 
icy  changes  requires  an  object  reuse  primitive.  It  is  often 
useful  for  a  system  to  be  able  to  respond  to  external  events. 
For  example,  during  a  fire,  all  doors  in  a  building  should 
be  unlocked  without  exception  (a  more  permissive  policy 
than  normal),  and  all  elevators  should  be  disabled  (a  less 
permissive  policy  than  normal).  In  the  case  of  an  embedded 
device,  a  system  under  attack  may  wish  to  change  the  policy 
enforced  by  its  reference  monitor.  There  are  several  ways  to 
change  polices.  One  way  is  to  overwrite  the  reference  mon¬ 
itor  with  a  completely  different  one.  Our  scrubbing  prim¬ 
itive  can  ensure  that  no  remnants  of  the  earlier  reference 
monitor  remain.  Since  cores  may  retain  some  information 
in  their  local  memory  following  a  policy  change,  our  scrub¬ 
bing  primitive  can  also  be  used  to  cleanse  the  cores. 

6  Related  Work 

There  has  always  been  an  important  relationship  be¬ 
tween  the  hardware  a  system  runs  on  and  the  security  of  that 
system.  Reconfigurable  systems  are  no  different,  although 
to  the  best  of  our  knowledge  we  are  the  first  to  address  the 
problem  of  isolation  and  physical  interface  conformance  on 
them.  However,  in  addition  to  the  related  work  we  have  al¬ 
ready  mentioned,  we  do  build  on  the  results  of  prior  related 


Table  2.  Comparison  of  Communication  Architectures 


Architecture  1 

Architecture  2 

Percent  Difference 

Slices 

146 

169 

15.75 

Flip  Flops 

177 

206 

16.38 

4  Input  LUTs 

253 

305 

20.55 

Maximum  Clock  Frequency 

270.93 

271.297 

0.14 

efforts.  In  particular,  we  build  on  the  ideas  of  reconfigurable 
security,  IP  protection,  secure  update,  covert  channels,  di¬ 
rect  channels,  and  trap  doors.  While  a  full  description  of  all 
prior  work  in  these  areas  is  not  possible,  we  highlight  some 
of  the  most  related. 

6.1  Reconfigurable  Hardware  Security 

The  closest  work  to  ours  is  the  work  of  Huffmire  et.  al. 
To  provide  memory  protection  on  an  FPGA,  Huffmire  et 
al.  propose  the  use  of  a  reconfigurable  reference  monitor 
that  enforces  the  legal  sharing  of  memory  among  cores  [19]. 
A  memory  access  policy  is  expressed  in  a  specialized  lan¬ 
guage,  and  a  compiler  translates  this  policy  directly  to  a  cir¬ 
cuit  that  enforces  the  policy.  The  circuit  is  then  loaded  onto 
the  FPGA  along  with  the  cores.  While  their  work  addresses 
the  specifics  of  how  to  construct  a  memory  access  moni¬ 
tor  efficiently  in  reconfigurable  hardware,  they  do  not  ad¬ 
dress  the  problem  of  how  to  protect  that  monitor  from  rout¬ 
ing  interference,  nor  do  they  describe  how  to  enforce  that 
all  memory  accesses  go  through  this  monitor.  This  paper 
directly  supports  their  work  by  providing  the  fundamental 
primitives  that  are  needed  to  implement  memory  protection 
on  a  reconfigurable  device. 

There  appears  to  be  little  other  work  on  the  specifics 
of  managing  FPGA  resources  in  a  secure  manner.  Chien 
and  Byun  have  perhaps  the  closest  work,  where  they  ad¬ 
dressed  the  safety  and  protection  concerns  of  enhancing 
a  CMOS  processor  with  reconfigurable  logic  [8].  Their 
design  achieves  process  isolation  by  providing  a  reconfig¬ 
urable  virtual  machine  to  each  process,  and  their  architec¬ 
ture  uses  hardwired  TLBs  to  check  all  memory  accesses. 
Our  work  could  be  used  in  conjunction  with  theirs,  using 
soft -processor  cores  on  top  of  commercial  off-the-shelf  FP- 
GAs  rather  than  a  custom  silicon  platform.  In  fact,  we  be¬ 
lieve  one  of  the  strong  points  of  our  work  is  that  it  may 
provide  a  viable  implementation  path  to  those  that  require  a 
custom  secure  architecture,  for  example  execute-only  mem¬ 
ory  [31]  or  virtual  secure  co-processing  [29]. 

Gogniat  et  al.  propose  a  method  of  embedded  system 
design  that  implements  security  primitives  such  as  AES  en¬ 
cryption  on  an  FPGA,  which  is  one  component  of  a  secure 
embedded  system  containing  memory,  I/O,  CPU,  and  other 
ASIC  components  [12].  Their  Security  Primitive  Controller 


(SPC),  which  is  separate  from  the  FPGA,  can  dynamically 
modify  these  primitives  at  runtime  in  response  to  the  de¬ 
tection  of  abnormal  activity  (attacks).  In  this  work,  the  re¬ 
configurable  nature  of  the  FPGA  is  used  to  adapt  a  crypto 
core  to  situational  concerns,  although  the  concentration  is 
on  how  to  use  an  FPGA  to  help  efficiently  thwart  system 
level  attacks  rather  than  chip-level  concerns.  Indeed,  FP- 
GAs  are  a  natural  platform  for  performing  many  crypto¬ 
graphic  functions  because  of  the  large  number  of  bit-level 
operations  that  are  required  in  modern  block  ciphers.  How¬ 
ever,  while  there  is  a  great  deal  of  work  centered  around 
exploiting  FPGAs  to  speed  cryptographic  or  intrusion  de¬ 
tection  primitives,  systems  researchers  are  just  now  start¬ 
ing  to  realize  the  security  ramifications  of  building  systems 
around  hardware  which  is  reconfigurable. 

Most  of  the  work  relating  to  FPGA  security  has  been  tar¬ 
geted  at  the  problem  of  preventing  the  theft  of  intellectual 
property  and  securely  uploading  bitstreams  in  the  field.  Be¬ 
cause  such  attacks  directly  impact  their  bottom  line,  indus¬ 
try  has  already  developed  several  techniques  to  combat  the 
theft  of  FPGA  IP,  such  as  encryption  [6]  [23]  [24],  finger¬ 
printing  [27],  and  watermarking  [28].  However,  establish¬ 
ing  a  root  of  trust  on  a  fielded  device  is  challenging  because 
it  requires  a  decryption  key  to  be  incorporated  into  the  fin¬ 
ished  product.  Some  FPGAs  can  be  remotely  updated  in 
the  field,  and  industry  has  devised  secure  hardware  update 
channels  that  use  authentication  mechanisms  to  prevent  a 
subverted  bitstream  from  being  uploaded  [16]  [15].  These 
techniques  were  developed  to  prevent  an  attacker  from  up¬ 
loading  a  malicious  design  that  causes  unintended  function¬ 
ality.  Even  worse,  the  malicious  design  could  physically 
destroy  the  FPGA  by  causing  the  device  to  short-circuit 
[14].  However,  these  authentication  techniques  merely  en¬ 
sure  that  a  bitstream  is  authentic.  An  “authentic”  bitstream 
could  contain  a  subverted  core  that  was  designed  by  a  third 
party. 

6.2  Covert  Channels,  Direct  Channels, 
and  Trap  Doors 

The  work  in  Section  4.1  directly  draws  upon  the  ex¬ 
isting  work  on  covert  channels.  Exploitation  of  a  covert 
channel  results  in  the  unintended  flow  of  information  be¬ 
tween  cores.  Covert  channels  work  via  an  internal  shared 


resource,  such  as  power  consumption,  processor  activity, 
disk  usage,  or  error  conditions  [47]  [41].  Classical  covert 
channel  analysis  involves  the  articulation  of  all  shared  re¬ 
sources  on  chip,  identifying  the  share  points,  determining 
if  the  shared  resource  is  exploitable,  determining  the  band¬ 
width  of  the  covert  channel,  and  determining  whether  reme¬ 
dial  action  can  be  taken  [25].  Storage  channels  can  be  mit¬ 
igated  by  partitioning  the  resources,  while  timing  channels 
can  be  mitigated  with  sequential  access,  a  fact  we  exploit  in 
the  construction  of  our  bus  architecture.  Examples  of  reme¬ 
dial  action  include  decreasing  the  bandwidth  (e.g.,  the  intro¬ 
duction  of  artificial  spikes  (noise)  in  resource  usage  [44])  or 
closing  the  channel.  Unfortunately,  an  adversary  can  extract 
a  signal  from  the  noise,  given  sufficient  resources  [37]. 

Of  course  our  technique  is  primarily  about  restricting  the 
opportunity  for  direct  channels  and  trap  doors  [49].  Our 
memory  protection  scheme  is  an  example  of  that.  With¬ 
out  any  memory  protection,  a  core  can  leak  secret  data  by 
writing  the  data  directly  to  memory.  Another  example  of  a 
direct  channel  is  a  tap  that  connects  two  cores.  An  uninten¬ 
tional  tap  is  a  direct  channel  that  can  be  established  through 
luck.  For  example,  the  place-and-route  tool’s  optimization 
strategy  may  interleave  the  wires  of  two  cores. 

7  Conclusion 

The  design  of  reconfigurable  systems  is  a  complex  pro¬ 
cess,  with  multiple  software  tool  chains  that  may  have  dif¬ 
ferent  trust  levels.  Since  it  is  not  cost-effective  to  develop  an 
optimized  tool  chain  from  scratch  to  meet  assurance  needs, 
only  the  most  sensitive  cores  should  be  designed  using  a 
trusted  tool  chain.  To  meet  performance  needs,  most  cores 
could  be  designed  with  commodity  tools  that  are  highly  op¬ 
timized  but  untrusted,  which  results  in  multiple  cores  on  a 
chip  with  different  trust  levels.  Our  methodology  will  not 
lead  to  those  less  trusted  portions  becoming  more  depend¬ 
able  or  correct,  but  it  will  isolate  trusted  portions  from  the 
effects  of  their  subversion  or  failure.  To  address  this  situ¬ 
ation,  developers  will  need  to  build  monitored  or  fail-safe 
systems  on  top  of  FPGAs  to  prevent  the  theft  of  critical  se¬ 
crets. 

We  have  presented  two  low-level  protection  mechanisms 
to  address  these  challenges,  moats  and  drawbridges,  and 
we  have  analyzed  the  trade-offs  of  each.  Although  larger 
moats  consume  more  area  than  smaller  moats,  they  have 
better  performance  because  longer  segments  can  be  used. 
Our  interconnect  tracing  primitive  works  together  with  our 
moat  primitive  in  a  complementary  way  by  allowing  smaller 
moats  to  be  used  without  sacrificing  performance.  We  have 
also  described  how  these  basic  primitives  are  useful  in  the 
implementation  of  a  higher-level  memory  protection  prim¬ 
itive,  which  can  prevent  unintended  sharing  of  information 
in  embedded  systems. 
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